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Abstract 


Background: Genome size is implicated in the form, function, and ecological success of a species. Two principally different mech- 
anisms are proposed as major drivers of eukaryotic genome evolution and diversity: polyploidy (i.e., whole-genome duplication) or 
smaller duplication events and bursts in the activity of repetitive elements. Here, we generated de novo genome assemblies of 17 
caddisflies covering all major lineages of Trichoptera. Using these and previously sequenced genomes, we use caddisflies as a model 
for understanding genome size evolution in diverse insect lineages. 

Results: We detect a ~14-fold variation in genome size across the order Trichoptera. We find strong evidence that repetitive element 
expansions, particularly those of transposable elements (TEs), are important drivers of large caddisfly genome sizes. Using an inno- 
vative method to examine TEs associated with universal single-copy orthologs (i.e., BUSCO genes), we find that TE expansions have a 
major impact on protein-coding gene regions, with TE-gene associations showing a linear relationship with increasing genome size. 
Intriguingly, we find that expanded genomes preferentially evolved in caddisfly clades with a higher ecological diversity (i.e., various 
feeding modes, diversification in variable, less stable environments). 

Conclusion: Our findings provide a platform to test hypotheses about the potential evolutionary roles of TE activity and TE-gene 
associations, particularly in groups with high species, ecological, and functional diversities. 


Keywords: biodiversity; de novo genome assembly, genomics, genomic diversity, gnome duplication, genome size evolution, insects, 
repetitive elements, transposable elements, Trichoptera 


Background 


evolution, it has been regarded as the exception in animals [12, 
13]. However, ancient WGD has been hypothesized to be an impor- 
tant driver of evolution of mollusks (e.g., [14]), amphibians (e.¢g., 
[15, 16]), fish (e.g., [17-19]), and arthropods (e.g., [20-22]), includ- 
ing multiple putative ancient large-scale gene duplications within 
Trichoptera [23]. 

RE expansion is an important driver of genome size varia- 
tion in many eukaryotic genomes [24, 25]. The two major cat- 
egories of REs are tandem repeats (e.g., satellite DNA) and mo- 
bile transposable elements (TEs). TEs are classified into Class I 


Genome size is a fundamental biological character. Studying its 
evolution may potentially lead to a better understanding of the 
origin and underlying processes of the myriad forms and func- 
tions of plants and animals. These diversification processes re- 
main at the core of much biological research. Given their high 
species, ecological, and functional diversities, insects are excel- 
ent models for such research. To date 1,345 insect genome size 
estimates have been published [1], ranging 240-fold from 69 Mb 
in chironomid midges [2] to 16.5 Gb in the mountain grasshop- 











per Podisma pedestris [3]. Genome size variation relates poorly to 
the number of coding genes or the complexity of the organism 
C-value enigma [4-7]), and evolutionary drivers of genome size 
variation remain a topic of ongoing debate (e.g., [8-11]). Two princi- 
pally different mechanisms are proposed as primary drivers of eu- 
karyotic genome size evolution: whole-genome duplication (WGD, 
i.e., polyploidy) or smaller duplication events and expansion of 
repetitive elements (REs [6]). While WGD is ubiquitous in plant 








(retrotransposons: endogenous retroviruses, related long termi- 
nal repeat [LTR], and non-LTR retrotransposons: SINEs [short in- 
terspersed nuclear elements], LINEs [long interspersed nuclear 
elements]) and Class II elements (DNA transposons [26]). In in- 
sects, the known genomic proportion of TEs ranges from 1% in 
the Antarctic midge Belgica antarctica [27] to 65% in the migra- 
tory locust Locusta migratoria [28]. Broad-scale analysis of TE abun- 
dance in insects suggests that some order-specific signatures are 
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present; however, major shifts in TE abundance are also com- 
mon at shallow taxonomic levels [29, 30], including in Trichoptera 
[31]. The movement and proliferation of REs can have deleteri- 
ous consequences on gene function and genome stability [32-36]. 
Moreover, repeat content and abundance can turn over rapidly 
even over short evolutionary time scales (reviewed in [37]). This 
rapid evolution has consequences for genome evolution and spe- 
ciation; e.g., repeat divergence causes genetic incompatibilities 
between even closely related species [38]. However, TEs can also 
be sources of genomic innovation with selective advantages for 
the host [39-44] and they can contribute to global changes in 
gene regulatory networks [45-47]. Investigating RE dynamics in 
diverse clades provides a powerful lens for understanding their 
roles in genome function and evolution. Broad study of RE dy- 
namics in species-rich groups with wide variation in RE activity 
is an important step towards efficiently identifying study systems 
at finer taxonomical scales (natural populations, species com- 
plexes, or recently diverged species) that are ideally suited to ad- 
vance our understanding of molecular and evolutionary mech- 
anisms underlying genome evolution. In addition, by taking this 
biodiversity genomics approach, we can develop new model sys- 
tems and eventually better understand links between environ- 
mental factors, genome size evolution, adaptation, and speciation 
(see [48]). 

With >16,500 species, caddisflies (Trichoptera) are among the 
most diverse of all aquatic insects [49]. Their species richness is 
reflective of their ecological diversity, including, e.g., microhabi- 
tat specialization, a full array of feeding modes, and diverse use 
of underwater silk secretions [50, 51]. An initial comparison of 6 
caddisfly species found wide genome size variation in Trichoptera 
(ranging from 230 Mb to 1.4 Gb). In that study, we hypothesized 
that the observed variation was correlated with caddisfly phy- 
logeny and that TEs contributed to a suborder-specific increase 
of genome size [31]. 

Here, we present a multi-faceted analysis to investigate 
genome size evolution in the order Trichoptera, as an example for 
highly diversified non-model organisms. Specifically, we (i) esti- 
mated genome size for species across the order to explore phy- 
logenetic patterns in the distribution of genome size variation in 
Trichoptera and (ii) generated 17 new Trichoptera genomes to an- 
alyze, in conjunction with 9 existing genomes, the causes (WGD, 
TE expansions) of genome size variation in the evolution of cad- 
disflies. Studying the genomic diversity of this highly diversified 
insect order adds new insights into drivers of genome size evolu- 
tion with potential to shed light on how genome size is linked to 
form, function, and ecology. 



































Data Description 
Genomic resources 


Here, we combined long- and short-read sequencing technologies 
to generate 17 new de novo genome assemblies across a wide taxo- 
nomic range, covering all major lineages of Trichoptera. Details on 
sequencing coverage and assembly strategies are given in Supple- 
mentary Data File $1.2, Supplementary Data File $1.3 and Supple- 
mentary Note 3. To assess quality, we calculated assembly statis- 
tics with QUAST v5.0.2 [52], examined gene completeness with 
BUSCO v5 [53, 54], and screened for potential contamination with 
taxon-annotated GC-coverage (TAGC) plots using BlobTools v1.0 
({55], Supplementary Figs S31-S47). The new genomes are of com- 
parable or better quality than other Trichoptera genomes previ- 














ously reported in terms of BUSCO completeness and contiguity 
Table 1). This study increases the number of assemblies in this 
order from 9 to 26, nearly tripling the number of available cad- 
disfly genomes and thus providing a valuable resource for study- 
ing genomic diversity across this ecologically diverse insect order. 
The annotation of these genomes predicted 6,413-12,927 proteins 
Supplementary Data File $1.2). Most of the annotated proteins 
94.4-98.8%) showed significant sequence similarity to entries in 
the NCBI nr database. GO Distributions were similar to previously 
annotated caddisfly genomes, i.e., the major biological processes 
were cellular and metabolic processes. Catalytic activity was the 
argest subcategory in molecular function, and the cell membrane 
subcategories were the largest cellular component (Supplemen- 
tary Figs S1-S30). This project has been deposited at NCBI under 
BioProject ID: PRINA558902. For accession numbers of individual 
assemblies see Table 1. 

We downloaded existing Trichoptera genomes from GenBank 
62] or Lepbase [58] and used these in conjunction with our newly 
generated genomes to analyze genome size evolution as explained 
in the following sections of this manuscript. 








Flow cytometry 


n addition to genomic sequence data, we used flow cytome- 
try (FCM) to detect genome size variation across the order. Our 
study increased the number of species with available FCM-based 
genome size estimates from 4 [63] to 31. Estimates were submitted 
to the Animal Genome Size Database [1]. 








Analysis 
Genome size evolution in Trichoptera 


On the basis of the genomes of 6 trichopteran species, Olsen et al. 
[31] found a 3-fold suborder-specific increase of genome size and 
hypothesized that genome size variation is correlated with their 
phylogeny. To test this hypothesis, we first reconstructed phyloge- 
netic relationships by analyzing ~2,000 single-copy BUSCO genes 
from the 26 study species (Figs 1 and 2, Supplementary Fig. S48). 
We obtained a molecular phylogeny that was in agreement with 
recent phylogenetic hypotheses ([64], see Supplementary Note 6) 
and that showed that Trichoptera is divided into two suborders: 
Annulipalpia (Figs 1 and 2: Clade A, blue) and Integripalpia (con- 
sisting of basal Integripalpia [Fig. 1: Clade B1-3, light green] and 
infraorder Phryganides [Fig. 1: clade B4, dark green]). Trichopter- 
ans use silk to build diverse underwater structures (see Fig. 1; Sup- 
plementary Note 6, Supplementary Fig. S48). Thus, we refer to An- 
nulipalpia as “fixed retreat- and net-spinners,” to Phryganides (In- 
tegripalpia) as “tube case-builders,” and to basal Integripalpia as 
“cocoon-builders.” 

We used 3 approaches for estimating genome size across 
Trichoptera: k-mer distribution estimates, back-mapping of se- 
quence data to available draft genomes (as described in [65, see 
also 66]), and FCM (Supplementary Note 7, Figs S49-S72, Supple- 
mentary Data File S1.7). FCM estimates can be affected by chro- 
matin condensation, the proportion of cells in GO-G1 phases [67, 
68], and endoreplication in insect cells and tissues [69]. Sequence- 
based estimates can be affected by REs in the genome, result- 
ing in smaller genome size estimates (e.g., [63, 70, 71]), as well 
as by GC-content because sequence library preparation including 
PCR amplification steps is associated with underrepresentation of 
GC- and AT-rich regions [72]. Bland-Altman plots (Supplementary 
Note 8, Supplementary Fig. S73) revealed general agreement of all 
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Figure 1: Ecological diversity (right) and genome size (left) in caddisflies. Phylogenetic relationships derived from ASTRAL-III analyses using single 
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Sericostomatidae). 


3 methods in our study. However, the FCM estimates were gen- 
erally higher compared to the sequence-based estimates (Fig. 1, 
Supplementary Data File S1.7), and, among all 3 approaches, this 
measure is expected to be the most accurate [9]. We observe that 
variation among the methods increased with genome size, indi- 
cating issues potentially caused by repeat content (see section Re- 
peat Dynamics). 

We observed large variation in genome size across the or- 
der. Genome size tends to be lower in fixed retreat- and net- 
spinners and cocoon-builders compared to tube case—builders 
(Fig. 1). Specifically, we observe that genome size varies ~14-fold, 
ranging from 1C = 154 Mb in cocoon-builders (Fig. 1, B1: Hydrop- 
tilidae) to 1C = 2,129 Mb in tube case-builders (Fig. 1, clade B4: 
Limnephilidae). Of the 29 species analyzed by FCM, Halesus digita- 
tus (Fig. 1, clade B4: Limnephilidae, Intergripalpia) possessed the 
largest genome (1C = 2,129 Mb), while the genome of Hydropsy- 
che saxonica (Fig. 1, clade A: Hydropsychidae, fixed retreat- and 
net-spinners) was the smallest (1C = 242 Mb). Genome size esti- 
mates based on sequence-based methods (k-mer—based and back- 
mapping) range from 1C = 154-160 Mb in Agraylea sexmaculata 
(Fig. 1, clade B1: Hydroptilidae, cocoon-builders) to 1C = 1,238- 
1,400 Mb in Sericostoma sp. (Fig. 1, clade B4: Sericostomatidae, tube 
case-builders). 











BUSCO genes. Goeridae, which was not included in the BUSCO gene set, was placed according to [64]. ASTRAL support values (local posterior 
probabilities) >0.9 are given for each node. The placement of Hydroptilidae (clade B1) was ambiguous. Because its placement was poorly supported in 
our analyses, we placed it according to Thomas et al. [64]. Taxa were collapsed to family level. Trichoptera are divided into two suborders: Annulipalpia 
“fixed retreat- and net-spinners,” clade A: blue) and Intergripalpia (clade B: green), which includes basal Integripalpia (“cocoon-builders,” clades 
B1-B3, dark green) and Phryganides or “tube case-builders” (clade B4: light green). “Cocoon-builders” are divided into “purse case-building” (clade B1), 
‘tortoise case-building” (clade B2), and “free-living” (clade B3) families. Genome size estimates based on different methods (Genomescope2: orange, 
Backmap.pl: black, flow cytometry [FCM]: brown) are given for various caddisfly families. Each dot corresponds to a mean estimate of a species. For 
detailed information on the species and number of individuals used in each method see Supplementary Data File $1.7. Colors and clade numbers in 
he phylogenetic tree refer to colored boxes with illustrations. The following species are illustrated by Ralph Holzenthal: a: Hydropsyche sp. 
Hydropsychidae); b: Chimarra sp. (Philopotamidae); C: Stenopsyche sp. (Stenopsychidae); d: Polycentropus sp. (Polycentropodidae); e: Agraylea sp. 
(Hydroptilidae); f: Glossosoma sp. (Glossosomatidae); g: Rhyacophila sp. (Rhyacophilidae); h: Fabria inomata (Phryganeidae); i: Micrasema sp. 
(Brachycentridae); j: Goera fuscula (Goeridae); k: Sphagnophylax meiops (Limnephilidae); 1: Psilotreta sp. (Odontoceridae), m: Grumicha grumicha 











Repeat Dynamics 
Repetitive element abundance and classification 


To understand the structural basis of genome size variation across 
the order Trichoptera we explored RE content. We found that ma- 
jor expansions of transposable elements (TEs) contribute to larger 
genomes in tube case-builders and some cocoon-builders, but 
particularly in tube case-builders with a average of ~600 Mb of 
REs compared to ~138 Mb in fixed retreat- and net-spinners (Fig. 2, 
Supplementary Data File S2.1). LINEs are the most abundant 
classified TEs in cocoon- and tube case-builders and comprise 
>154 Mb on average in tube case-builders, or a mean genome pro- 
portion of 16.9% (range = 5.6-34.7%). This represents a 1.8- and 
2.8-fold increase in genome proportion relative to cocoon-builders 
and fixed retreat— and net-spinners, respectively. The LINE abun- 
dance of >312 Mb in Odontocerum albicorne exceeds the entire as- 
sembly lengths (152-282 Mb) of the 3 smallest genome assemblies 
(Hydropsyche tenuis, Parapsyche elsis, and A. sexmaculata) (Fig. 2). 
DNA transposons also comprise large genomic fractions in both 
cocoon- and tube case-builders (mean of 54.4 and 32.8 Mb, respec- 
tively). However, despite containing a large number of base pairs, 
they make up a smaller fraction of total base pairs in the gnomes 
of cocoon- and tube case-builders than in fixed retreat- and net- 
spin, ners (mean genome proportion = 5.9%, 4.5%, and 11.1% in 
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Figure 2: Repeat abundance and classification in 26 caddisfly genomes. Number of bp for each repeat type is given for each caddisfly genome. A: 
Repeat abundance and classification. Phylogenetic tree was reconstructed with ASTRAL-III using single BUSCO genes from the genome assemblies. 
[he placement of Hydroptilidae (clade B1) was ambiguous. Because its placement was poorly supported in our analyses, we placed the single 
hydroptilid taxon (Agraylea sexmaculata) according to Thomas et al. [64]. Species names corresponding to the abbreviations in the tree can be found in 
[fable 1. Trichoptera are divided into two suborders: Annulipalpia ("fixed retreat- and net-spinners,” clade A: blue) and Intergripalpia (clade B: green), 
which includes basal Integripalpia (“cocoon-builders,” clades B1-B3, dark green) and Phryganides or “tube case-builders’” (clade B4: light green). 
Cocoon-builders” are divided into “purse case-building” (clade B1), “tortoise case-building” (clade B2), and “free-living” (clade B3) families. An 
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lustration of a representative of each clade is given. The “other repeats” category includes rolling circles, Penelope, low-complexity, simple repeats, and 


Zzoz AeW €z uo isenB Aq EG1LZES9/1 | oe!6/eous!osebi6/E601 OL /lop/ejole/sousiosebi6/woo' dnoolwepede//:sdy}y Wold pepeojumoq 


6 | GigaScience, 2022, Vol. 11, No. 1 


tube case—-builders, cocoon-builders, and fixed retreat- and net- 
spinners, respectively) (Fig. 2B) and thus cannot, by themselves, 
explain the larger genome sizes. SINEs, LTRs, Penelope (grouped 
with “other” repeats in Fig. 2), and satellite DNAs show a dispro- 
portionate increase in cocoon- and tube case-builders; however, 
all categories combined make up a relatively small proportion of 
their genomes (all <3% on average in Integripalpia) (Fig. 2B). Un- 
classified repeats are the most abundant repeat category across 
all Trichoptera, and they also show disproportionate expansions 
in both cocoon- and case-builders relative to fixed retreat- and 
net-spinners (Fig. 2). The general trends noted in our assembly- 
based analysis of REs were corroborated by our reference-free 
analysis of repeat abundance (Supplementary Data File $2.2, Sup- 
plementary Data File S2.3, Supplementary Figs S146 and S147, 
Supplementary Note 10). 

















TE age distribution analysis 


To test whether the observed abundance patterns of specific TEs 
are driven by shared ancient proliferation events or more re- 
cent/ongoing activity of the respective TEs, we analyzed TE age 
distribution plots. These plots allow us to visualize specific RE 
classes/superfamilies that account for shifts in RE composition 
and abundance and infer the relative timing of those shifts based 
on the distribution of sequence divergence within each RE cate- 
gory. TE age distributions showed a high abundance of recently 
diverged TE sequences in cocoon- and tube case-builders, par- 
ticularly in LINEs, DNA transposons, and LTRs in which the ma- 
jority of TEs for a given class show 0-10% sequence divergence 
within copies of a given repeat (Fig. 3). This trend was particu- 
larly pronounced among tube case-builders, with several species 
showing high abundance of LINEs and DNA transposons with 0- 
5% sequence divergence (Fig. 3). This pattern suggests that the 
observed TE expansion is due primarily to ongoing TE activity 
within lineages rather than a few shared bursts of activity in an- 
cestral lineages. This is further supported by our analysis of re- 
peat subclasses with age distribution plots (Supplementary Fig. 
S148). For example, in our study, LINE abundance is often due to 
the expansion of different LINE subclasses even between species 
in the same sub-clade (e.g., compare Lepidostoma with Micrasema, 
Himalopsyche with Glossosoma; Supplementary Fig. S148). We also 
find evidence of shared ancient bursts of SINE activity in cocoon- 
and tube case-builders, although SINEs are not an abundant re- 
peat class in any species (mean genomic proportion = 1.9% [SD 
1.7%]) (Supplementary Fig. S148). 














Associations between TE sequences and protein-coding 
genes 


During early exploration of our sequence data, we made an un- 
expected discovery that in some lineages, universal single-copy 
orthologs, or “BUSCO genes,” showed higher than expected cover- 
age depth of mapped reads in 1 or more of their sequence frag- 
ments. Further analysis showed that these high-coverage BUSCO 
sequence regions are typically RE sequences (primarily TEs) that 
are either embedded within or located immediately adjacent to 
BUSCO genes, such that the BUSCO algorithm includes them in 
its annotation of a given gene. We refer to BUSCO genes contain- 
ing these putative RE fragments as “TE-associated BUSCOs” (Sup- 
plementary Fig. S149, Supplementary Note 11). By estimating how 
many times they occur, we can quantitatively measure how TE- 
gene interactions change with changing genome size. In fact, we 
detected a positive linear relationship between TE-gene interac- 
tions and increasing genome size when measured with this ac- 
cidently discovered metric. We found major expansions of TE- 


associated BUSCOs in cocoon- and tube case-builders (Fig. 4A) 
that are significantly correlated with total repeat abundance, as 
well as the genomic proportion of LINEs and DNA transposons 
(Supplementary Fig. S150). TE-associated BUSCOs comprise a rel- 
atively large fraction of total BUSCO genes in these lineages (mean 
of 11.2% and 21.4% of total BUSCOs in cocoon- and tube case- 
builders, respectively), compared to annulipalpian lineages (mean 
= 6.2%). This finding highlights the major impact of REs on the 
composition of protein-coding genes in species with repeat-rich 
genomes. The BUSCO-associated sequences may represent TEs 
recently inserted into BUSCO genes, the remnants left behind fol- 
owing historical TE transposition events, or TE sequences that are 
immediately adjacent to and inadvertently classified as BUSCO 
sequences. 
To confirm that unexpectedly high-coverage sequence regions 
in TE-associated BUSCOs were in fact TE-derived sequences, we 
compared patterns of BUSCO gene structure (through pairwise 
alignment) across species pairs in which high-coverage regions 
Le., putative TE sequences) were present in the BUSCO gene of 
1 species (i.e., the “inflated” species) but absent in the homolo- 
gous BUSCO of the other (i.e., the “reference” species). This anal- 
ysis showed that in 73 of 75 randomly sampled alignments, refer- 
ence species showed gaps or highly non-contiguous alignments in 
high-coverage regions of the inflated species (Fig. 4B), suggesting 
that sequence insertions are typically present in high-coverage se- 
quence regions of TE-associated BUSCOs. Our subsequent BLAST 
analysis showed that comparing a TE-associated BUSCO against 
its own assembly produced thousands to millions of BLAST hits 
from many contigs (Fig. 4C). This confirmed that the indel se- 
quence present in high-coverage regions of inflated species shows 
high sequence similarity to REs elsewhere in the genome. We then 
used an intersect analysis on the BLAST results to confirm that 
the large majority of the excess BLAST hits overlap with RE an- 
notations throughout the genome, most of which are TEs, with 
LINEs and DNA transposons being most abundant (Fig. 4D, Supple- 
mentary Data File $2.5). Finally, we found that if we replaced the 
TE-associated BLAST query sequence with the homologous but 
non-TE-associated BUSCO from its counterpart reference species, 
the number of BLAST hits was fewer (Fig. 4C, Supplementary Data 
File $2.6), offering further evidence that the TE sequence inser- 
tions driving the pattern of high coverage in read mapping ex- 
cess BLAST hits are absent in reference species and thus carriable 
across relatively short time scales within Trichoptera. Taken to- 
gether, these findings provide strong evidence that TE sequences 
(especially LINEs and DNA transposons) inadvertently annotated 
by BUSCO can account for the high-coverage regions that we ob- 
serve in BUSCO genes (Fig. 4D). 

Our accidental discovery that quantifying the frequency of TE- 
associated BUSCOs can serve as an estimate of TE-gene associ- 
ations may prove useful in other systems given the wide use of 
BUSCO analysis in genomic studies. Finer details supporting the 
TE-gene association analysis are reported in Supplementary Note 
11. 
































Gene and genome duplications 

Recently, a transcriptome-based study found evidence for puta- 
tive ancient gene and genome duplications in hexapods, includ- 
ing potential WGD events in caddisflies [23], suggesting that du- 
plication events could be responsible for some genome size varia- 
tion in Trichoptera. We investigated whether this pattern persists 
with whole-genome data and found that the age distribution of 
duplications in 18 genomes was significantly different compared 
to the background rate of gene duplication (Supplementary Figs 
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Figure 3: Transposable element age distribution landscapes. Representative examples are chosen from major Trichoptera lineages. The y-axis shows 
TE abundance as a proportion of the genome (e.g., 1.0 = 1% of the genome). The x-axis shows sequence divergence relative to TE consensus sequences 
for major TE classes. TE classes with abundance skewed toward the left (i.e., low sequence divergence) are inferred to have a recent history of 
diversification relative to TE classes with right-skewed abundance. Plots were generated in dnaPipeTE. Plots for all species are shown in 
Supplementary Fig. S148. For tip labels of the phylogenetic tree see Fig. 2. 
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Figure 4: TE-BUSCO-gene associations in Trichoptera species. (A) Raw abundance of TE-associated BUSCO sequences present in the assembly of 2,442 


sequence fragment. Bottom: A typifying alignment between a TE-associated 
“reference species”) that lacks TE-association. The non-TE-associated ortho 


compare any TE-associated BUSCOs against an assembly for the same speci 
Second, when non-TE-associated orthologs of the TE-associated BUSCOs in 





BLAST hits for classified REs when TE-associated BUSCOs are compared aga 


S161 and S162). To identify whether any significant peak is consis- 
tent with a potential WGD, we used mixture modeling to identify 
peaks in these gene age distributions, which recovered no obvious 
peak consistent with an ancient WGD. To further investigate po- 
tential WGD, we used Smudgeplot [73] to visualize the haplotype 
structure and to estimate ploidy of the genomes. 

While Smudgeplot predicted most of the genomes to be diploid, 
4 genomes with rather small genome sizes (230-650 Mb) were 
predicted to be tetraploid (H. tenuis, Rhyacophila evoluta RSS1 and 
HR1, and P. elsis). However, the Genomescope? results indicate that 
these are highly homozygous samples. Low heterozygosity is a 
known confounder of Smudgeplot analyses [73] because it inflates 
the signal of duplication when compared to the low level of het- 
erozygosity. We therefore interpret these 4 putative polyploids as 
artifacts of low heterozygosity in the analysis. Moreover, in some 
cases Smudgeplot results remain unclear because the estimated 
coverage (1n) differs from the sequencing coverage, peak cover- 
age from the backmap.pl approach, and Genomescope2 coverage 
"kcov" (Supplementary Data File $1.8.) when automatically esti- 
mated from the data. The adjustment of the expected haploid cov- 
erage based on Genomescope2 kcov when running Smudgeplot 
suggests that some species might not be diploid (Hesperophylax 
magnus: octoploid, Supplementary Figs S96-S98; Micrasema longu- 
lum ML1: tripolid, Supplementary Figs Figs S116-S118; and O. al- 














BUSCOs in the OrthoDB 9 Endopterygota dataset. (B) Top: An example of a coverage depth profile of a TE-associated BUSCO gene (BUSCO 
EOGO90R02Q9 from ML1 [“inflated species”]) that shows unexpected high coverage in the second exon putatively due to the presence of an RE-derived 


BUSCO and its orthologous BUSCO from a closely related species 
ogous BUSCO shows non-contiguous alignment in regions of inflated 


coverage in the TE-associated BUSCO, consistent with the presence of an RE-derived sequence fragment in the TE-associated BUSCO that is absent in 
he reference species. (C) Summary of total bases annotated as REs obtained from each of the two BLAST searches. First, when we used BLAST to 


es, BLAST hits included megabases of annotated repeats (dark bars). 
the first search are taken from a close relative and compared against the 








inflated species using BLAST, there is a dramatic decreae in BLAST hits annotated as REs. Note log scale on the y-axis. (D) Summary of annotations for 


inst an assembly of the same species using BLAST. 


bicorne: tetraploid, Supplementary Figs Figs S127-S129. However, 
further sequencing as well as karyotyping including chromosome 
counting would need to be done to confirm polyploidy in these 
species. 


Discussion 


The drivers and evolutionary consequences of genome size evo- 
ution are a topic of ongoing debate. Several models have been 
proposed [9]. Some hypothesize genome size to be a (mal)adaptive 
trait by affecting phenotypic traits such as developmental/life his- 
tory, body size, and other cell size-related effects [59, 74-76] re- 
viewed in [9]. On the other hand, neutral theories suggest that 
DNA accumulation occurs only by genetic drift without selective 
pressures playing a major role in the accumulation or loss of DNA 
the mutational hazard hypothesis [MHH] [24] and the mutational 
equilibrium hypothesis [MEH] [25]). The MHH only allows for small 
deleterious effects for the accumulation of extra DNA, which is ac- 
companied by higher mutation rates in larger genomes [24], while 
the MEH focuses on the balance between insertions and deletions. 
It suggests that genome expansions arise by means of “bursts” of 
duplication events or TE activity and that genome shrinkage may 
be caused by a more constant rate of small deletions [25]. 
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In this study, we observe that genome size varies ~14-fold 
across the order Trichoptera, with lower genome size estimates in 
hxed retreat- and net-spinners and cocoon-builders compared to 
tube case-builders, and explore potential drivers of genome size 
evolution. Although recent genomic studies have shown evidence 
of bursts of gene duplication and gene family expansion during 
the evolution of hexapods [23, 77], the presence of ancient genome 
duplication events is still a subject of debate [78-80]. When com- 
puting haplotype structure and ploidy estimation, Smudgeplot 
suggested polyploidy in 3 species. However, karyotypes including 
chromosome counts are missing for these species because only 
very few have been reported for caddisflies in general [81-83]. We 
found no evidence of ancient WGD in the gene age distribution 
in our Trichoptera genomes, although we recognize that some of 
our current genome assemblies might be too fragmented to infer 
synteny. This does not mean that we can rule out that duplica- 
tion events played a role in genome size evolution in Trichoptera 
in the past. The emergence of Pacific Biosciences HiFi genomes of 
caddisflies (e.g., Darwin Tree of Life Project is currently planning 
to sequence several caddisfly genomes [84]) will allow a deeper 
exploration of putative ancient duplication events in Trichoptera. 
We found evidence that TE expansions (especially LINEs) were 
important drivers of genome size evolution in Trichoptera (Fig. 2, 
Supplementary Figs S146 and $147), which is consistent with MEH. 
The TE age distribution analyses suggested that the high abun- 
dance of LINEs was due to ongoing/recent activity occurring in- 
dependently across cocoon- and particularly tube case-builders 
Fig. 3, Supplementary Fig. S148). Thus, the shift to large genomes 
in these lineages does not appear to be due to a single (or a few) 
shared ancient events; rather they maintained dynamic turnover 
in composition of their large genomes. Mutational bias affecting 
pathways tied to TE regulation may affect insertion/deletion ra- 
tios and subsequently lead to lineage-specific shifts in genome 
size equilibrium [85]. Such changes may be stochastic (e.g., due 
to drift) or linked to traits that evolve on independent trajectories 
as lineages diverge and are thereby constrained by phylogeny. Eco- 
ogical factors, demographic history, and effective population size 
can further affect mutation rates. For example, environmental 
stress can trigger bursts of TE activity and elevated mutation rates 
86-88], driving lineages that occupy niche space with frequent 
exposure to environmental stress toward increased TE loads and 
arger genomes. Similarly, lineages with small effective population 
sizes or that are prone to population bottlenecks may have higher 
mutation rates and/or reduced efficacy of natural selection, which 
would otherwise purge mildly deleterious TE load. 

Although our study is not designed to pinpoint specific forces 
maintaining large genomes in some lineages, the pattern that we 
observe in the distribution of genome size (i.e., lower genome 
size estimates in fixed retreat- and net-spinners and cocoon- 
builders compared to tube case-builders) leads us to hypothe- 
size that ecological factors may play a role in genome size evo- 
ution in the order. The 3 focal groups discussed here exhibit 
markedly different ecological strategies. Larvae of fixed retreat— 
and net-spinners generally occupy relatively narrow niche space 
in oxygen-rich flowing-water (mostly stream/river) environments 
where they rely on water currents to bring food materials to their 
filter nets. The evolutionary innovation of tube-case making is 
thought to have enabled tube case-builders to occupy a much 
greater diversity of ecological niche space by allowing them to ob- 
tain oxygen in lentic (e.g., pond, lake, marsh) environments, which 
are much more variable in temperature and oxygen availability 
than lotic environments [89, 90]. This environmental instability is 
greater over short (daily, seasonal) and long time scales (centuries, 
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millennia) [91]. It is thus plausible that these tube case-building 
lineages experience greater environmental stress and less stable 
population demographics that could lead to both more frequent 
TE bursts and reduced efficacy of natural selection in purging 
deleterious effects of TE expansions as described above [24, 25]. 

We show that TE expansions (especially LINEs and DNA trans- 
posons) in cocoon- and tube case-builders have a major impact 
on protein-coding gene regions (Fig. 4). These TE-gene associations 
show a linear relationship with increasing genome size. This trend 
is particularly pronounced among tube case-builders, in which 
TE-associated BUSCOs comprise a mean of 21.4% of total BUSCO 
genes (compared with 6.2% in annulipalpians). This finding cor- 
roborates other studies highlighting the role of TEs as drivers of 
rapid genome evolution [92-95] and highlights their impact on ge- 
nomic regions that have potential effects on phenotypes. Ques- 
tions remain as to what evolutionary roles such changes in genic 
regions may play. In general, TE insertions are considered to have 
deleterious effects on their host'’s fitness activity [96, 97]. They are 
known to “interrupt” genes [34], pose a risk of ectopic recombina- 
tion that can lead to genome rearrangements [32, 35, 98], and have 
epigenetic effects on neighboring sequences [55, 99]. Therefore, 
purifying selection keeps TEs at low frequencies [34]. However, 
there is growing evidence that TE activity can also be a cnitical 
source of new genetic variation driving diversification via chromo- 
somal rearrangements and transposition events, which can result 
in mutations [100], including examples of co-option [101]; e.g., re- 
cent research in mammals has shown that DNA transposon frag- 
ments can be co-opted to form regulatory networks with genome- 
wide effects on gene expression [45]. 

Ecological correlates with genome size are widely discussed in 
other taxa [61, 102-105]. Caddisflies and other diverse insect lin- 
eages that feature various microhabitat specializations, feeding 
modes, and/or the use of silk represent evolutionary replicates 
with contrasting traits and dynamic genome size evolution. They 
thus have high potential as models for understanding links be- 
tween ecology and the evolution of REs, genomes, and phenotypes. 
Our study lays a foundation for future work in caddisflies that in- 
vestigates the potential impact of TE expansions on phenotypes 
and tests for evidence of co-option/adaptive impacts of TE-rich 
genomes against a null of neutral or slightly deleterious effects. 

















Potential implications 


Many open questions remain as to the causes and consequences 
of genome size evolution. As we move forward in an era where 
genome assemblies are attainable for historically intractable or- 
ganisms (e.g., due to constraints given large genome sizes, tis- 
sue limitations, no close reference available) we can leverage new 
model systems spanning a greater diversity of life to understand 
how genomes evolve. Here, we provide genomic resources and 
new genome size estimates across lineages of an underrepre- 
sented insect order that spans major variation in genome size. 
These data allowed us to study genome size evolution in a phy- 
logenetic framework to reveal lineage-specific patterns in which 
genome size correlates strongly with phylogeny and ecological 
characteristics within lineages. We find that large genomes dom- 
inate lineages with a wider range of ecological variation and that 
ongoing recent TE activity seems to maintain large genomes in 
these lineages. This leads us to hypothesize that ecological fac- 
tors may be linked to genome size evolution in this group. The fu- 
ture directions spawned by our findings highlight the potential for 
using Trichoptera and other diverse insect groups to understand 
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the link between ecological and genomic diversity, a link that has 
been challenging to study with past models [9]. 

We also show that TE expansions are associated with increas- 
ing genome size and have an effect on protein-coding regions. 
These effects have been greatest in the most species-rich and eco- 
logically diverse caddisfly clades. While TEs are generally consid- 
ered to have deleterious effects on their host’s fitness activity, their 
roles can also be neutral or even adaptive. TE activity can be a 
critical source of new genetic variation and thus an important 
driver for diversification. Caddisflies and potentially other non- 
model insect groups are excellent models to test these contrast- 
ing hypotheses, as well as the potential impact of TEs on pheno- 
types. Using these models, especially with respect to the increas- 
ing emergence of high-quality insect genomes [106], will allow re- 
searchers to identify recurring patterns in TE dynamics and in- 
vestigate their evolutionary implications across diverse clades. 


Methods 


DNA extraction, library preparation, sequencing, 
and sequence read processing 


We extracted high molecular weight genomic DNA (gDNA) from 
17 individuals (15 species) of caddisfly larvae (for sampling in- 
formation, see Supplementary Data File $1.1) after removing the 
intestinal tracts using a salting-out protocol adapted from [107] 
as described in Supplementary Note 1. We generated gDNA li- 
braries for a low-cost high-contiguity sequencing strategy, i.e., us- 
ing a combination of short (lumina) and long-read (Nanopore or 
Pacific Biosciences) technologies as described in Supplementary 
Note 2. For details on sequencing coverage for each specimen see 
Supplementary Data File $1.3. 


De novo genome assembly, annotation, and 
quality assessment 


We applied different assembly strategies for different datasets. 
First, we applied a long-read assembly method using wtdbg2 v2.4 
(WTDBG, RRID:SCR_017225) [108] with subsequent short-read pol- 
ishing with Pilon v1.22 (Pilon, RRID:SCR_014731) [109] because 
this method revealed good results in previous de novo assem- 
blies in caddisflies [63]. In cases where this pipeline did not meet 
the expected quality regarding contiguity and BUSCO complete- 
ness, we applied de novo hybrid assembly approaches of MaSuRCA 
v.3.1.1 (MaSuRCA, RRID:SCR_010691) [110] (Supplementary Not 
3). Ilumina-only data were assembled with SPAdes (SPAdes, RRID 

2 





SCR_000131) [111] (explained in Supplementary Note 3). Prior to 





annotating the individual genomes with MAKER2 v2.31.10 [11 





e 
113] we used RepeatModeler v2.0 (RepeatModeler, RRID:SCR_015 
027) and RepeatMasker v4.1.0 (RepeatMasker, RRID:SCR_012954) 
to identify species-specific REs in each of the assemblies, relative 
to RepBase libraries v20181026 [114]. Transcriptome evidence for 
the annotation of the individual genomes included their species- 
specific or closely related de novo transcriptome provided by 1KITE 
[115, 116] (Suplementary Data File $1.9) or downloaded from Gen- 
bank as well as the complementary DNA and protein models 
from Stenopsyche tienmushanensis [117] and Bombyx mori (AR102, 
GenBank accession ID No. GCF_000151625.1). Additional protein 
evidence included the uniprot-sprot database (downloaded 25 
September2018). We masked repeats on the basis of species- 
specific files produced by RepeatModeler. For ab initio gene predic- 
tion, species-specific AUGUSTUS gene prediction models, as well 
as B. mori SNAP gene models, were provided to MAKER. The Evi- 
denceModeler (EVidenceModeler, RRID:SCR_014659) [118] and tR- 





























NAscan [119] options in MAKER were used to produce a weighted 
consensus gene structure and to identify transfer RNA genes. 
MAKER default options were used for BLASTN (BLASTN, RRID: 
SCR_001598), BLASTX (BLASTX, RRID:SCR_001653), and TBLASTX 
TBLASTX, RRID:SCR_011823) searches. Two assemblies (Agapetus 
fuscipens GL3 and M. longulum ML1) were not annotated because of 
their low contiguity. All protein sequences were assigned putative 
names by BlastP Protein-Protein BLAST 2.2.30+ searches [120] and 
were functionally annotated using command line Blast2Go v1.3.3 
Blast2GO, RRID:SCR_005828) [121] (see Supplementary Note 4, 
Supplementary Figs S1-S30). 

We calculated assembly statistics with QUAST v5.0.2 (QUAST, 
RRID:SCR_001228) [52] and examined completeness with BUSCO 
v5 (BUSCO, RRID:SCR_015008) [53, 54] using the Endopterygota 
odb10 dataset with the options "-long, -m = genome". A sum- 
mary of the assembly statistics and BUSCO completeness is 
given in Table 1. The final genome assemblies and annota- 
tions were screened and filtered for potential contaminations 
with taxon-annotated GC-coverage (TAGC) plots using BlobTools 
v1.0 (Blobtools, RRID:SCR_017618) [122]. Details and blobplots are 
given in Supplementary Note 5 and Supplementary Figs S31-S47. 














Species tree reconstruction 


We used the single-copy orthologs resulting from a BUSCOVv3.0.2 
analysis (with the Endopterygota odb9 dataset and options -long, 
—m = genome and -sp = fly) to generate a species tree. We first 
combined single-copy ortholog amino acid files from each species 
into a single FASTA for each ortholog. We then aligned them with 
the MAFFT L-INS-i algorithm [123]. We selected amino acid sub- 
stitution models for each ortholog using ModelFinder (option -m 
mfp, [124] in IQtree v.2.0.6 [125] and estimated a maximum likeli- 
hood tree with 1,000 ultrafast bootstrap replicates [126] with the 
BNNI correction (option -bb 1000 -bnni). We combined the best 
maximum likelihood tree from each gene for species tree analy- 
sis in ASTRAL-III [127]. A locus tree was inferred using the align- 
ment file (-s) and the partition file (-S) with the settings —prefix loci 
and -T AUTO in IQtree. Gene and site concordance factors were 
calculated with IQTree using the species tree (-t), the locus tree 
(-gcf), and the alignment file (-s) with 100 quartets for comput- 
ing the site concordance factors (-scf 100) and —prefix concord for 
computing the gene concordance factors. We visualized the trees 
using FigTree v.1.4.4 (FigTree, RRID:SCR_008515). 





Genome size estimations and genome profiling 


Genome size estimates of 27 species were conducted using FCM 
according to Otto [56] using Lycopersicon esculentum cv. Stupické 
polnity¢kové rané (2C = 1.96 pg [57]) as internal standard and 
propidium iodine as stain (see Suplementary Data File $1.6). Ad- 
ditionally, we used trimmed, contamination-filtered short-read 
data (see Supplementary Note 2) to conduct genome profiling 
estimation of major genome characteristics such as size, het- 
erozygosity, and repetitiveness) using a k-mer distribution-based 
method (GenomeScope 2.0, RRID:SCR_017014) [73]. Genome scope 
profiles are available online (see links to Genomescope 2 in Su- 
plementary Data File $1.4). In addition, we applied a second 
sequencing-based method for genome size estimates, which uses 
the back-mapping rate of sequenced reads to the assembly and 
coverage distribution (backmap.pl v0.1 [65], see Suplementary 
Data File $1.5). Details of all 3 methods are described in Supple- 
mentary Note 7. Coverage distribution per position and genome 
size estimate from backmap.pl are shown in Supplementary Figs 
S49-S72. We assessed the congruence among the 3 quantita- 
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tive methods of measurement (Genomescope2, Backmap.pl, and 
FCM) with Bland-Altman-Plots using the function BlandAltman- 
Leh::bland.altman.plot in ggplot2 [60] in RStudio [128] (Supple- 
mentary Note 8, Supplementary Fig. S73). 


Repeat dynamics 
Repeat abundance and classification 


We identified and classified REs in the genome assemblies of each 
species using RepeatModeler2.0 [129]. We annotated repeats in 
the contamination-filtered assemblies with RepeatMasker 4.1.0 
(RepeatMasker, RRID:SCR_012954 using the custom repeat li- 
braries generated from RepeatModeler2 for each respective as- 
sembly with the search engine set to “ncbi’” and using the -xsmall 
option. We converted the softmasked assembly resulting from 
the first RepeatMasker round into a hardmasked assembly us- 
ing the lc2n.py script [130]. Finally, we reran RepeatMasker on 
the hard-masked genome with RepeatMasker’s internal arthro- 
pod repeat library using -species “Arthropoda.” We then merged 
RepeatMasker output tables from both runs by parsing them with 
a custom-made script (RM_table_parser_families_.py [131]) and 
then combined the resulting data columns for the two runs in Ex- 
cel. 

We also estimated RE abundance and composition us- 
ing RepeatExplorer2 [132, 133] and dnaPipeTE v.1.3.1 [134]. 
These reference-free approaches quantify repeats directly from 
unassembled short-read data. These analyses allowed us to test 
for general consistency of patterns with our assembly-based ap- 
proach described above and to test for the presence of abun- 
dant repeat categories such as satellite DNAs, which can com- 
prise large fractions of genomes yet can be prone to poor rep- 
resentation in the genome assembly. Prior to analysis, we nor- 
malized contamination-filtered (see Supplementary Note 2) in- 
put datasets to 0.5x coverage using RepeatProfiler [135] and seqtk 
136] and then ran RepeatExplorer2 clustering with the Metazoa 
3.0 database specified for annotation (Supplementary Fig. S146) 
and dnaPipeTE with the -RM_lib flag set to the Repbase v20170127 
repeat library (Supplementary Fig. S147). 




















TE age distribution analysis 


We further characterized RE dynamics in Trichoptera by analyz- 
ing TE landscapes, which show relative age differences among TE 
sequences and their genomic abundance. We used these analy- 
ses to test whether abundance patterns of specific TEs are driven 
by shared ancient proliferation events or more recent/ongoing ac- 
tivity of the respective TEs. For example, if shared ancient prolif- 
eration is driving abundance patterns of a given TE, the majority 
of its copies would show moderate to high sequence divergence 
(e.g., >10% pairwise divergence). In contrast, if abundance pat- 
terns are driven by recent/ongoing activity of a given TE, we would 
expect the majority of its sequences to show low sequence diver- 
gence (e.g., 0-10%). We generated TE age distribution plots using 
dnaPipeTE v1.3.1 [134] with genomic coverage for each species 
sampled to 0.5x prior to analysis and the -RM_lib flag set to the 
Repbase v20170127 repeat library (Supplementary Fig. S148). 








TE sequence associations with protein-coding genes 


We analyzed BUSCO genes (obtained from a BUSCOv3.0.2 analy- 
sis with the Endopterygota odb9 dataset and options —-long, —-m = 
genome and -sp = fly) for all species to quantify the abundance 
of TE-associated BUSCOs across samples and investigated associ- 
ations between TEs and genic sequences in Trichoptera lineages 
by quantifying the abundance of TE-associated BUSCO genes (for 
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presence and absence of TE-associated BUSCOs see Supplemen- 
tary Fig. S149, Supplementary Data File S2.4). This analysis also al- 
lowed us to quantify shifts in associations between TEs and genic 
regions across Trichoptera lineages with varying repeat abun- 
dance. We identified BUSCO genes with high-coverage sequence 
regions based on coverage profiles and quantified their genomic 
abundance by using each TE-associated BUSCO as a query ina 
BLAST search against their respective genome assembly. We then 
conducted intersect analysis for all unique BUSCO hits from high- 
coverage sequences to determine whether these were annotated 
as TEs. We calculated the total number of bases in filtered BLAST 
after subtracting the number of bases at the locus belonging to all 
“complete” BUSCO genes and categorized high-coverage sequence 
regions in BUSCO genes based on their annotation status and re- 
peat classification using custom scripts [131]. We plotted the num- 
ber of the high-coverage BUSCO sequence regions belonging to 
RE categories (i.e., classes and subclasses) alongside plots of the 
relative genomic abundance of each respective category. In addi- 
tion, we investigated BUSCO genes with regions of high coverage 
by pairwise alignments. Specifically, we visualized alignments of 
BUSCOs with high-coverage sequence regions (ie., the “inflated 
species”) alongside orthologous BUSCOS that lack such regions 
taken from closely related species (i.e., the “reference” species). 
We further tested this prediction by taking the set of BUSCOs that 
only exhibited high-coverage regions in the inflated species and 
contrasted results of the two BLAST searches followed by an in- 
tersect analysis. A detailed description of this method is provided 
in Supplementary Note 11. 








Gene and genome duplications 
Inference of WGDs from gene age distributions 


To recover signal from potential WGDs, for each genome, we used 
the DupPipe pipeline to construct gene families and estimate the 
age distribution of gene duplications [137, 138]. We translated 
DNA sequences and identified open reading frames (ORFs) by 
comparing the Genewise [139] alignment to the best-hit protein 
from a collection of proteins from 24 metazoan genomes from 
Metazome v3.0. For all DupPipe runs, we used protein-guided DNA 
alignments to align our nucleic acid sequences while maintain- 
ing the ORFs. We estimated synonymous divergence (K;) using 
PAML with the F3x4 model [140] for each node in the gene fam- 
ily phylogenies (Supplementary Data File $1.10). We first identi- 
fied taxa with potential WGDs by comparing their paralog ages to 
a simulated null distribution without ancient WGDs using a K-S 
goodness-of-fit test [141]. We then used mixture modeling to iden- 
tify any significant peaks consistent with a potential WGD and to 
estimate their median paralog K, values. Significant peaks were 
identified using a likelihood ratio test in the boot.comp function 
of the package mixtools in R [142]. 

















Visualization of genome structure to estimate ploidy using 
Smudgeplots 


We visualized the genome structure and estimated ploidy lev- 
els with Smudgeplot. For this purpose, we extracted genomic k- 
mers from k-mer counts produced with jellyfish (as described in 
“Genome size estimations and genome profiling”) using “jellyfish 
dump” with coverage thresholds previously estimated from k-mer 
histograms using the smudgeplot.py script. We computed the set 
of k-mer pairs with the Smudgeplot tool hetkmers. After gen- 
erating the list of k-mer pair coverages, we generated smudge- 
plots using the coverage of the k-mer pairs and the “plot” tool 
within Smudgeplot. Ploidy, as well as the haploid k-mer coverage, 
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was estimated directly from the data and compared to the esti- 
mated kcov reported by Genomescope?2, sequencing coverage (and 
sequencing-based kcov), and peak coverage from the backmap.pl 
approach (see Supplementary Data File $1.8). When the haploid k- 
mer coverage estimated by Smudgeplot was inconsistent with the 
kcov observed by Genomescope2, it was manually adjusted using 
-n in smudgeplot.py plot. Details of the method and smudgeplots 
are given in Supplementary Note 9 and Supplementary Figs S74— 
145. 





Additional Files 


Supplementary Data File $1.1. Sample information. 
Supplementary Data File $1.2. Assembly statistics. 
Supplementary Data File $1.3. Sequencing coverages. 
Supplementary Data File $1.4. GenomeScope? results. 
Supplementary Data File $1.5. Backmap.pl results. 
Supplementary Data File $1.6. Flow cytometry results. 
Supplementary Data File $1.7. Genome size summary. 
Supplementary Data File $1.8. Comparison coverage. 
Supplementary Data File $1.9. List of transcriptomes. 
Supplementary Data File $1.10. Paths to final Ks files. 
Supplementary Data File $2.1. Assembly based repeat summary. 
Supplementary Data File $2.2. RepeatExplorer summary. 
Supplementary Data File $2.3. dnapipeTE_Results. 
Supplementary Data File $2.4. TE-associated BUSCOs per 
Species. 
Supplementary Data File S2.5. Summary of intersect analysis. 
Supplementary Data File S2.6. Species pair tests. 
Supplementary Material.docx 

Supplementary Note 1. DNA extraction. 

Supplementary Note 2. Sequencing strategies. 

Supplementary Note 3. Assembly strategies. 

Supplementary Note 4. Functional annotation of protein cod- 
ing genes. 

Supplementary Note 5. Contamination filtering. 

Supplementary Note 6: Caddisfly silk usage. 

Supplementary Note 7: Genome size estimations and genome 
profiling. 

Supplementary Note 8: Bland-Altman-Plots. 

Supplementary Note 9: Visualization of genome structure to 
estimate ploidy using smudgeplots. 

Supplementary Note 10: Repeat abundance and classification 
based on reference-free analyses. 

Supplementary Note 11: TE sequence association with 
protein-coding genes. 

Supplementary Note 12: TE-associated BUSCOs 

Supplementary Figure S1. Blast2GO Annotation Results of 
Drusus annulatus. 

Supplementary Figure S2. Blast2GO Functional Annotation for 
Drusus annulatus. 

Supplementary Figure S3. Blast2GO Annotation Results of 
Agraylea sexmaculata. 

Supplementary Figure S4. Blast2GO Functional Annotation for 
Agraylea sexmaculata. 

Supplementary Figure S5. Blast2GO Annotation Results of 
Glossosoma conforme. 

Supplementary Figure S6. Blast2GO Functional Annotation for 
Glossosoma conforme. 

Supplementary Figure S7. Blast2GO Annotation Results of 
alesus radiatus. 
Supplementary Figure S8. Blast2GO Functional Annotation for 

alesus radiatus. 
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Supplementary Figure S9. Blast2GO Annotation Results of Hi- 


alopsyche phryganeae. 
Supplementary Figure S10. Blast2GO Functional Annotation 
or Himalopsyche phryganeae. 


Supplementary Figure S11. 


Lepidostoma basale. 
Supplementary Figure S12. Blast2GO Functional Annotation 
or Lepidostoma basale. 





Supplementary Figure S13. 


Micrasema longulum ML3. 


Blast2GO Annotation Results of 


Blast2GO Annotation Results of 


Supplementary Figure S14. Blast2GO Functional Annotation 


for Micrasema longulum ML3. 


Supplementary Figure S15. 


Micrasema minimum. 


Blast2GO Annotation Results of 


Supplementary Figure S16. Blast2GO Functional Annotation 


for Micrasema minimum. 


Supplementary Figure S17. 


icropterna sequax. 


Blast2GO Annotation Results of 


Supplementary Figure $18. Blast2GO Functional Annotation 


for Micropterna sequax. 


Supplementary Figure S19. 


Odontocerum albicorne. 
Supplementary Figure $20. Blast2GO Functional Annotation 





TAGC) plots of Agraylea sexmacu 


[TAGC) plots of t 
[TAGC) plots of 


[TAGC) plots of 





for Odontocerum albicorne. 


Supplementary Figure S21. 


Parapsyche elsis. 


Blast2GO Annotation Results of 


Blast2GO Annotation Results of 


Supplementary Figure S22. Blast2GO Functional Annotation 


for Parapsyche elsis. 


Supplementary Figure S23. 


Philopotamus ludiferatus. 


Blast2GO Annotation Results of 


Supplementary Figure S24. Blast2GO Functional Annotation 


for Philopotamus ludiferatus. 


Supplementary Figure S25. 


Rhyacophila brunneae. 


Blast2GO Annotation Results of 


Supplementary Figure $26. Blast2GO Functional Annotation 


for Rhyacophila brunneae. 


Supplementary Figure S27. 


Rhyacophila evoluta HR1. 


Blast2GO Annotation Results of 


Supplementary Figure $28. Blast2GO Functional Annotation 


for Rhyacophila evoluta HR1. 


Supplementary Figure S29. 


Rhyacophila evoluta RSS1. 


Blast2GO Annotation Results of 


Supplementary Figure S30. Blast2GO Functional Annotation 


for Rhyacophila evoluta RSS1. 


Supplementary Figure $31. 


Supplementary Figure S32. 


Taxon-annotated GC-coverage 
TAGC) plots of Agapetus fuscipens GL3 genome assembly. 


Taxon-annotated GC-coverage 
ata AS19 genome assembly. 





Supplementary Figure $32. 


Supplementary Figure $34.1 





Supplementary Figure $35.1 
alesus radiatus. 
Supplementary Figure S36. 











Supplementary Figure S37. 


Supplementary Figure S38. 


Lepidostoma basale. 


Taxon-annotated GC-coverage 


TAGC) plots of Drusus annulatus AC1 genome assembly. 


axon-annotated GC-coverage 


[AGC) plots of Glossosoma conforme G1. 


axon-annotated GC-coverage 


Taxon-annotated GC-coverage 


imalopsyche phryganeae. 


Taxon-annotated GC-coverage 


Taxon-annotated GC-coverage 


TAGC) plots of Micrasema longulum ML1. 





Supplementary Figure S39. 




















Taxon-annotated GC-coverage 


TAGC) plots of Micrasema longulum ML3. 
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Supplementary Figure S40. Taxon-annotated GC-coverage 
TAGC) plots of Micrasema minimum. 
Supplementary Figure S41. Taxon-annotated GC-coverage 
TAGC) plots of Micropterna sequax. 
Supplementary Figure S42. Taxon-annotated GC-coverage 
TAGC) plots of Odontocerum albicorne. 
Supplementary Figure S43. Taxon-annotated GC-coverage 
TAGC) plots of Parapsyche elsis. 
Supplementary Figure S44. Taxon-annotated GC-coverage 
TAGC) plots of Philopotamus ludiferatus. 
Supplementary Figure S45. Taxon-annotated GC-coverage 
TAGC) plots of Rhyacophila brunneae. 
Supplementary Figure S46. Taxon-annotated GC-coverage 
TAGC) plots of Rhyacophila evoluta HR1. 
Supplementary Figure S47. Taxon-annotated GC-coverage 
TAGC) plots of Rhyacophila evoluta RSS1. 
Supplementary Figure S48. Phylogenetic relationships derived 
from ASTRAL-III analyses using single BUSCO genes. 
Supplementary Figure S49. Agapetus fuscipens: Coverage dis- 
tribution per position and genome size estimate from backmap.pl. 
Supplementary Figure S50. Agraylea sexmaculata: Cover- 
age distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure $51. Agrypnia vestita. Coverage distri- 
bution per position and genome size estimate from backmap.pl. 
Supplementary Figure S52. Drusus annulatus: Coverage distri- 
bution per position and genome size estimate from backmap.pl. 
Supplementary Figure S53. Glossosoma conforme G1: Cov- 
erage distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S54. Glossosoma conforme Glo: Cov- 
erage distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S56. Halesus radiatus. Coverage distri- 
bution per position and genome size estimate from backmap.pl. 
Supplementary Figure S57. Hesperophylax magnus: Cover- 
age distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S58. Himalopsyche phryganeae: Cov- 
erage distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S59. Lepidostoma basale: Coverage dis- 
tribution per position and genome size estimate from backmap.pl. 
Supplementary Figure S61. Micrasema longulum ML1: Cov- 
erage distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S62. Micrasema longulum ML3: Cov- 
erage distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S63. Micrasema minimum: Cover- 
age distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S64. Micropterna sequax: Coverage dis- 
tribution per position and genome size estimate from backmap.pl. 
Supplementary Figure S65. Odontocerum albicorne: Cover- 
age distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S66. Parapsyche elsis: Coverage distri- 
bution per position and genome size estimate from backmap.pl. 
Supplementary Figure S67. Philopotamus ludificatus: Cov- 
erage distribution per position and genome size estimate from 
backmap.pl. 
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Supplementary Figure S68. Rhyacophila brunnea: Cover- 
age distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S69. Rhyacophila evoluta HR1: Cov- 
erage distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S70. Rhyacophila evoluta Rss1: Cov- 
erage distribution per position and genome size estimate from 
backmap.pl. 
Supplementary Figure S71. Sericostoma sp.: Coverage distri- 
bution per position and genome size estimate from backmap.pl. 
Supplementary Figure S72. Stenopsyche tienhuanesis: Cov- 
erage distribution per position and genome size estimate from 
backmap.pl. 

Supplementary Figure S73. Bland-Altman-Plots to test the 
comparability of agreement between the three quantitative meth- 
ods of genome size measurement (Genomescope2, Backmap.pl 
and FCM; supplementary Note 7). 

Supplementary Figure S74. 
fuscipens GL3 on the linear scale. 

Supplementary Figure S75. 
fuscipens GL3 on the log scale. 

Supplementary Figure S76. Smudgeplot for Agraylea sexmac- 
ulata AS19 on the linear scale. 

Supplementary Figure S77. Smudgeplot for Agraylea sexmac- 
ulata AS19 on the log scale. 

Supplementary Figure S78. Smudgeplot for Agrypnia vestiva 
on the linear scale. 

Supplementary Figure S79. Smudgeplot for Agrypnia vestiva 
on the log scale. 

Supplementary Figure S80. Smudgeplot for Drusus annulatus 
on the linear scale. 

Supplementary Figure S81. Smudgeplot for Drusus annulatus 
AC1 on the log scale. 

Supplementary Figure S82. Smudgeplot for Glossosma con- 
forme G1 on the linear scale. 

Supplementary Figure S83. Smudgeplot for Glossosma con- 
forme G1 on the log scale. 

Supplementary Figure S84. Smudgeplot for Glossosma con- 
forme Glo on the linear scale. 

Supplementary Figure S85. Smudgeplot for Glossosma con- 
forme Glo on the log scale. 

Supplementary Figure S86. Smudgeplot for Halesus radiatus 
L2 on the linear scale. 

Supplementary Figure S87. Smudgeplot for Halesus radiatus 
L2 on the log scale . 

Supplementary Figure S88. Smudgeplot for Himalopsyche 
phryganeae on the linear scale. 
Supplementary Figure S89. Smudgeplot for Himalopsyche 
phryganeae on the log scale. 
Supplementary Figure S90. Smudgeplot for Himalopsyche 
phryganeae on the linear scale, 1n was manually adjusted using 
-n based on the Genomescope?2 kcov (-n=53). 
Supplementary Figure $91. Smudgeplot for Himalopsyche 
phryganeae on the log scale, In was manually adjusted using -n 
based on the Genomescope2 kcov (-n=53). 
Supplementary Figure S92. Smudgeplot for Himalopsyche 
phryganeae on the linear scale, 1n was manually adjusted using 
-n based on the kcov using the sequencing coverage (-n=61). 
Supplementary Figure S93. Smudgeplot for Himalopsyche 
phryganeae on the log scale, 1n was manually adjusted using -n 
based on the kcov using the sequencing coverage (-n=61). 

















Smudgeplot for Agapetus 


Smudgeplot for Agapetus 
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Supplementary Figure S94. Smudgeplot for Hesperophylax Supplementary Figure $118. Smudgeplot for Micrasema 
magnus on the linear scale. ongulum ML1 on the linear scale, 1n was manually adjusted us- 

Supplementary Figure S95. Smudgeplot for Hesperophylax ing -n based on the kcov using the sequencing coverage (-n=45). 
magnus on the log scale. Supplementary Figure S119. Smudgeplot for Micrasema 

Supplementary Figure S96. Smudgeplot for Hesperophylax ongulum ML1 on the log scale, In was manually adjusted using 
magnus on the linear scale, 1n was manually adjusted using -n -n based on the kcov using the sequencing coverage (-n=45). 
based on the Genomescope?2 kcov (-n=21). Supplementary Figure $120. Smudgeplot for Mcirasema min- 

Supplementary Figure S97. Smudgeplot for Hesperophylax imum KOS5 on the linear scale. 
magnus on the logscale, 1n was manually adjusted using -n based Supplementary Figure $121. Smudgeplot for Micrasema min- 
on the Genomescope?2 kcov (-n=21). imum KO5 on the log scale. 

Supplementary Figure S98. Smudgeplot for Hesperophylax Supplementary Figure $122. Smudgeplot for Micropterna se- 
magnus on the linear scale, 1n was manually adjusted using -n quax AB8 on the linear scale. 
based on the kcov using the sequencing coverage (-n=25). Supplementary Figure $123. Smudgeplot for Micropterna se- 

Supplementary Figure S99. Smudgeplot for Hesperophylax quax AB8 on the log scale. 
magnus on the logscale, 1n was manually adjusted using -n based Supplementary Figure $124. Smudgeplot for Odontocerum al- 
on the kcov using the sequencing coverage (-n=25). bicorne OD1 on the linear scale. 

Supplementary Figure $100. Smudgeplot for Hydropsyche Supplementary Figure $125. Smudgeplot for Odontocerum al- 
tenuis on the linear scale. bicorne OD1 on the linear scale. 

Supplementary Figure $101. Smudgeplot for Hydropsyche Supplementary Figure $126. Smudgeplot for Odontocerum al- 
tenuis on the log scale. bicorne OD1 on the linear scale, In was manually adjusted using 

Supplementary Figure S102. Smudgeplot for Lepidostoma -n based on the kcov using the sequencing coverage (-n=20). 
basale LB1 on the linear scale. Supplementary Figure $127. Smudgeplot for Odontocerum al- 

Supplementary Figure $103. Smudgeplot for Lepidostoma bicorne OD1 on the log scale, 1n was manually adjusted using -n 
basale LB1 on the log scale. based on the kcov using the sequencing coverage (-n=20). 

Supplementary Figure $104. Smudgeplot for Lepidostoma Supplementary Figure $128. Smudgeplot for Odontocerum al- 
basale on the linear scale, 1n was manually adjusted using -n bicorne OD1 on the linear scale, In was manually adjusted using 
based on the Genomescope?2 kcov (-n=36). -n based on the Genomescope? kcov (-n=25). 

Supplementary Figure $105. Smudgeplot for Lepidostoma Supplementary Figure $129. Smudgeplot for Odontocerum al- 
basale on the log scale, In was manually adjusted using -n based bicorne OD1 on the log scale, 1n was manually adjusted using -n 
on the Genomescope?2 kcov (-n=36). based on the Genomescope?2 kcov (-n=25). 

Supplementary Figure S106. Smudgeplot for Lepidostoma Supplementary Figure $130. Smudgeplot for Parapsyche elsis 
basale on the linear scale, 1n was manually adjusted using -n on the linear scale. 
based on the kcov using the sequencing coverage (-n=48). Supplementary Figure $131. Smudgeplot for Parapsyche elsis 

Supplementary Figure S107. Smudgeplot for Lepidostoma on the linear scale. 
basale on the log scale, 1n was manually adjusted using -n based Supplementary Figure S132. Smudgeplot for Philopotamus 
on the kcov using the sequencing coverage (-n=48). ludificatus Ph2 on the linear scale. 

Supplementary Figure S108. Smudgeplot for Micrasema Supplementary Figure $133. Smudgeplot for Philopotamus 

ongulum ML3 on the linear scale. ludificatus Ph2 on the log scale. 

Supplementary Figure S109. Smudgeplot for Micrasema Supplementary Figure $134. Smudgeplot for Plectrocnemia 

ongulum ML3 on the log scale. conspersa on the linear scale. 

Supplementary Figure $110. Smudgeplot for Micrasema Supplementary Figure $135. Smudgeplot for Plectrocnemia 

ongulum ML3 on the linear scale, 1n was manually adjusted us- conspersa on the log scale. 
ing -n based on the Genomescope2 kcov (-n=43). Supplementary Figure S136. Smudgeplot for Rhyacophila 
Supplementary Figure S111. Smudgeplot for Micrasema brunneae on the linear scale. 
ongulum ML3 on the log scale, In was manually adjusted using Supplementary Figure $137. Smudgeplot for Rhyacophila 
-n based on the Genomescope?2 kcov (-n=43). brunneae on the log scale. 
Supplementary Figure $112. Smudgeplot for Micrasema Supplementary Figure $138. Smudgeplot for Rhyacophila evo- 
ongulum ML3 on the linear scale, 1n was manually adjusted us- uta HR1 on the linear scale. 
ing -n based on the kcov using the sequencing coverage (-n=48). Supplementary Figure $139. Smudgeplot for Rhyacophila evo- 
Supplementary Figure $113. Smudgeplot for Micrasema uta HR1 on the log scale. 
ongulum ML3 on the log scale, 1n was manually adjusted using Supplementary Figure $140. Smudgeplot for Rhyacophila evo- 
-n based on the kcov using the sequencing coverage (-n=48). uta RSS1 on the linear scale. 

Supplementary Figure $114. Smudgeplot for Micrasema Supplementary Figure $141. Smudgeplot for Rhyacophila evo- 

ongulum ML1 on the linear scale. uta RSS1 on the log scale. 

Supplementary Figure S115. Smudgeplot for Micrasema Supplementary Figure $142. Smudgeplot for Sericostoma sp. 

ongulum ML1 on the log scale. on the linear scale. 

Supplementary Figure S116. Smudgeplot for Micrasema Supplementary Figure $143. Smudgeplot for Sericostoma sp. 

ongulum ML1 on the linear scale, 1n was manually adjusted us- on the log scale. 

ing -n based on the Genomescope2 kcov (-n=39). Supplementary Figure $144. Smudgeplot for Stenopsyche on 
Supplementary Figure S117. Smudgeplot for Micrasema the linear scale. 

ongulum ML1 on the log scale, 1n was manually adjusted using Supplementary Figure $145. Smudgeplot for Stenopsyche on 

-n based on the Genomescope?2 kcov (-n=39). the log scale. 
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Supplementary Figure S146. Repeat abundance summary 
from Repeat-Explorer2 . 
Supplementary Figure S147. Repeat abundance summary 
from dnaPipeTE. 
Supplementary Figure $148. Transposable element age distri- 
bution landscapes. 
Supplementary Figure $149. Presence and absence of TE- 
associated BUSCOs. 
Supplementary Figure $150. Correlations between bases in TE- 
associated BUSCO BLAST hits and genomic abundance of repeat 
categories. 
Supplementary Figure $151. BUSCO EOGO90ROA/C in IGV. 
Supplementary Figure $152. BUSCO EOGO90R0A26 in IGV. 
Supplementary Figure S153. BUSCO EOGO90ROAIP in IGV. 
Supplementary Figure $154. BUSCO EOGO90ROAIP in IGV. 
Supplementary Figure $155. BUSCO EOGO90ROBAL in IGV. 
Supplementary Figure S156. BUSCO EOGO90ROBV8 in IGV. 
Supplementary Figure $157. BUSCO EOGO90ROD3M in IGV. 
Supplementary Figure S158. BUSCO EOGO90RODSK in IGV. 
Supplementary Figure S159. BUSCO EOGO90RODJA in IGV. 
Supplementary Figure S160. BUSCO EOGO90RODOF in IGV. 
Supplementary Figure $161. Inference of WGDs from gene age 
distributions KS2. 
Supplementary Figure $162. Inference of WGDs from gene age 
distributions KS5 
Table S1: Ten BUSCOs of Hesperophylax magnus, their location 
in the genome and the start and end of the highly covered region. 
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